Helga Sigríður Thordersen Magnúsdóttir s202027 
                                                                             Hlynur Árni Sigurjónsson s192302
                                                                         Katrín Erla Bergsveinsdóttir s202026
                                                                            Kristín Björk Lilliendahl s192296

purple-divider

Advanced Business Analytics - Project

Tripadvisor - Copenhagen restaurants reviews


The idea is to scrape Tripadvisor reviews of restaurants in Copenhagen.

We all know the struggle of when we are going out to eat and no one knows where to go..



We wanted to analyse restaurants in the capital of Denmark, Copenhagen. Especially, because we are foreign students from Iceland, and therefore we often have to look at Tripadvisor to find out where to eat. Moreover, we thought it would be an interesting dataset to analyse from a business perspective. That will be discussed later on in Chapter 4 Business questions.

We wanted to gain information about review text, rating and time of review. Furthermore, we also wanted to get the restaurants information and basic reviewer information.

The information gained will be

Our thoughts of what can be done with the data

Contents


green-divider

Imported data 🐼

The first step as always is to install and import the necessary packages.

purple-divider

1. Scraper info

**It is not necessary to run the scraper code. It was added for reference to show what was done in the scraper.** \ A scraper tool is created to gather the reviews and the restaurants information into csv files. The skeleton of the tool was taken from a github page LaskasP. It later turned out that the code had many errors and crashed after a few calls. So the scraper tool was fixed and improved, with enriching the information gathered. In addition, information about reviewers was created. Since the data was large and the tool was expected to encounter some errors on the way, the urls links are stored into a csv. In our case, we chose to look into resturant in the Copenhagen area, with over 1900 resturants available. Each restaurant has on average 700 reviews so the scraping time is quick to add up.

The scraper is based on the beautifulsoup package and selenium. The reason for using selenium is to open and click on things, to retrieve next pages or additional information.

Selenium actions:

Procedure:

As can be seen in the three code snippets below there are a lot of "try: except:" clauses in the code. This is due to many smaller deviation on the Tripadvisor webpage. Data can be missing for some restaurants so the scraper tries to retrieve them, if not successful, it is left empty.

Scrape all restaurants urls

With the help of selenium the "next page" button is pushed until it has reached the end. Every resturant's url that is not "sponsored" in the ordering is saved. The sponsored restaurant appear many times and often the same restaurants, if this would be skipped, that data would have a lot of duplicates. Moreover, what is worse, is that it would extend the scraper tools to be a big margin.

Scrape the restaurants information

The restaurants information is scraped. Here the most applicable data was retrieved and stored into a seperate csv file. Here the beautifulsoup package was sufficient to retreive the data needed. Again, here we see many "try: except:" clauses in the code since there is some missing information for many of the resturants.

Get the reviewers information

Here selenium came to the rescue, as it was necessary to click buttons on the reviewers own page. The information stored here is mainly in the hope to get the connection between reviewers and restaurants. Tripadvisor has a community of reviewers and they can follow each other as on social platforms. The information in that regard is gathered along with the total reviews and "upvotes" the reviewer gives. The hope here is to shed light on the influence of specific reviewers and the value it could add to restaurants. Here detecting bad or fraudulent reviews is hopefully possible with the data at hand. The most frequent available data is the location and the join date of the reviewer. This information is quite important since a network can be created based on those attributes.

Initialise files and get the data

The data is gathered by firstly getting all urls, then looping through them and scraping the information. Firstly, initilise the file and the relevant column names.

Getting the reviews

Scrape all of the reviews and keep track of what urls are finished and who are left.

Remove the urls that are successful. Take a look at those that failed and try to fix what was missing or caused the bad retrieval.

Getting restaurant information

Here we run a function that gets the website with a request, the html parser is then used from beautifulsoup and the needed tags are located and stored.

Here, bad_url contained restaurant with no reviews and thus deemed redundant to our dataset.

Getting reviwer information

Scraping the reviews information after the file has been created from the dataframe

Merging csv files

In order to gather the data in a shorter time, multiple environments were run in parallel to gather the data simultaneously.

purple-divider

2. The datasets and data preparation

As mentioned in the scraper section, the tool was used to go iteratively through every "next" page of a specific Tripadvisor restaurant's review and gather the wanted information. The restaurants' information is also scraped seperately.

The data is saved into two files:

  1. restaurantInfo.csv: Contains information about each resturant.
  2. reviews.csv: Contains the reviews for each restaurant.

We start by importing the raw data files, cleaning and preparing them for further analysis.

Additionally, another dataset was downloaded from the internet. It included shapefiles of Denmark with municipality division.

green-divider

2.1 restaurantInfo.csv 🍕

As mentioned before, restaurantInfo.csv contains information about each restaurant. The informations are as follows:

From the informations above 👆🏼 we can see that we have three columns that contain a list, CousineType, all_ranks and all_ratings. If we would read the csv file without doing anything, the columns would appear as strings. So in order to prevent that and import it as a list we need to use converters to read it in correctly.

Let's examine how the data looks by loading the restaurantInfo.csv into a pandas dataframe.

By examining the dataframe we can see that the priceCategory column appears to be displaying in a weird way or not showing all the data.

Let's take a look if it does not import correctly to the dataframe or the view can not display it correctly 🤔

By looking at the first row we can see that Maple Casual Dining has a correct information in the priceCategory so the information imported correctly into the dataframe ✅

Data Cleaning 🧹

In order to work further with the dataset, we need to clean the dataframe. Since we now know that the priceCategory is imported correctly, we can start cleaning the column. Since '$$' is a keyword in Matplotlib, the column has to be mapped to something. A mapping to integers was created, since they are a nice way to represent the data.

Additional Dataset 🗺

Since we were able to obtain the address for each restaurant it would be nice to be able to plot the restaurants, to see the locations and use it for further analysis. In order to do so, we need to get the latitude and longitude coordinates of the restaurant address.

This can be achieved by using the address as a query string and calling the Open Street maps. Since some of the addresses are in a weird format, the name of the restaurant will first be used in the search query (with 'denmark' added at the end to clarify the search). If the location information can be found from the restaurant name, that will be used. If the location information can not be found from the name then the address will be used to generate the latitude and longitude. **Please note that it takes around 30 minutes to run the code below** 👇🏼

💡 Now there should be two new columns in the dataframe presenting the latitute and longitude for each restaurant.

Correct! We can see that 'Maple Casual Dining' is located in Vesterbrogade 24, coordinates (55.67, 12.558) 🌍

green-divider

2.2 reviews.csv ⭐️

Next up is looking into reviews.csv . It containts information about each review for every restaurant. Namely:

Let's import the reviews to a new dataframe:

The dataframe looks good, however we need to make sure that ratingDate is presented in a correct way, that is as a date:

We can see that ratingDate is a string. For further analysis, it will be benefitial to have converted this column into a timestamp object and that is done by using datetime:

Great! ratingDate is now in the correct format and we are able to work with the dataframe 💯

green-divider

2.3 Shapefiles 🇩🇰

An interesting thing to look into is the different areas where the restaurants are located. One way of doing that is to use shapefile.

The shapefile format is a digital vector storage format for storing geometric location and associated attribute information.

Shapefiles for the whole of Denmark were downloaded from DIVA-GIS. The files include information about the municipalities and their division within Denmark. These can be used later on when performing analysis that require information about different areas of Denmark.

The first step is to load and examine the dataset:

For simplification, all analysis performed will be focused on the capital region. Therefore, the shapefile is filtered on Hovedstaden and some of the included islands are skipped so the plot will be as clear as possible 🏝

Each municipality will be plotted in a different color based on their numerical ID in the shapefile.

purple-divider

3. Descriptive Statistics 🔎

Now we have scraped and imported all the necessary data we need, therefore we can check if we have any outliers, statistically numerical values of different columns in our dataset, and so on. This will be done by using Pandas ProfileReport.

Pandas profiling saves the work of visualizing and understanding the distribution of each variable. It generates a report with all the information easily available 👌🏼

Let's begin to look at the Restaurants dataframe!

3.1. Restaurants 🍻

We now put the cleaned dataframe from Chapter 2.1 restaurantInfo.csv into the Pandas ProfileReport to generate a profile and displaying it:
**If nothing is visible in the output cell below, it is necessary to go to File -> Trust Notebook**

Firstly, we can see that the columns storeName, storeAddress, nrReviews and location have high-cardinality, which refers to columns with values that are very uncommon or unique. However, that makes sense because we are analysing different restaurants. Moreover, it is interesting to see that several restaurants appear few times. These restaurants are for example places like Joe & The Juice, Sticks'n'Sushi and McDonald's, which are franchises all over Denmark. We can see that these are not duplicates, as seen in the code below:

We can see above 👆🏼that all of the Joe and the Juice restaurants have different address. \ Secondly, by looking at the avgRate column, we can see that the minimum value is 1 and maximum value is 5, which tells us that there are no strange values in that column. Moreover, we see that the average value for ratings is around 3.97. Therefore, customers that are satisfied are maybe more likely to give a review - or maybe Copenhagen just have great restaurants! 🥳 Could be interesting to look more into later on.

In the column storeAddress, we see that there are 6 restaurants with the same address, which is Vesterbrogade 3, Copenhagen 1630 Denmark. We wanted to look further into it:

By Google-ing the address we can see that these restaurants above are located in a food hall, and therefore it makes sense that they have the same address! 🏠🎡🎠🎢

We see that the column priceCategory has 579 or 24.7% missing values. However, it won't make a huge difference for our analyses later on. It will not be the main variable we will look at.

From the profiler it can be seen that it does not handle a column containing a list. Therefore, we do an additional check for the CousineType column. The reference code can be found here.

From the plot above we can see that most restaurants offer European dishes 🇪🇺, followed by Danish cousine 🇩🇰. Austrian dishes are the hardest one to find in restaurants in Copenhagen 🇦🇹.

When looking at the profile for latitude and longitude, we clearly see that we have some weird outliers in our dataframe. The average coordinates of Copenhagen are 55.6761 N latitude and 12.5683 E longitude. However, by looking at the minumum number in the latitude column we have -35.008636 and for the longitude column the minumum number is -87.83865855. Moreover, the maximum value is also weird for the longitude column, it is 121.0348037.

Let's plot the latitude and longitude of our dataframe to find the outlier/s. First we need to change the type of the latitude and longitude columns, as now they are strings.

If we look up the restaurant that gives us the latitude value -35.008636 we get Café Ocean and the latitude and longitude gives us the coordinates for a restaurant with the same name in the UK. The reason for that is because Café Ocean is permanently closed in Denmark ❌ so the codes pull the coordinated for the one in UK.

Looking at the addresses they are all located in Denmark, some are located in the Copenhagen area and some outside. This needs to be further analysed for the relevant business questions.

If we quickly look into what how many coordinates seems to be of in the latitude column:

👆🏼we see that we have around 36 restaurants that fall into that range and can be dropped if needed 😌

We want to see where the restaurants are located in Copenhagen. Therefore, we plotted them below. However, the plot only shows the top 350 restaurants in Copenhagen because the map code could not handle more data. When running with more than 350 it would not display. The reference code can be found here.

By looking at the top 350 restaurants in our dataframe, we can see that we have several outliers. One of the restaurant is located in London, England. Moreover, numerous restaurants are located outside of Copenhagen, but we were only focusing on the Copenhagen Area 😒. However, we are not going to anything now for these outliers but we might do it later on when working on our business questions.

green-divider

3.2 Reviews 📝

Let's now do the same here for the Reviews dataframe, as we did for the Restaurant dataframe. That is, apply Pandas ProfileReport on it.

Firstly, we see that the restaurant Kodbyens Fiskebar, is the restaurant which is most often reviewed 🐠 \ Let's look at where this restaurant is ranked in Copenhagen:

We can see that Kodbyens Fiskebar is ranked number 12 of 82 Seafood restaurants in Copenhagen and number 160 overall for restaurants in Copenhagen. These numerous reviews for this restaurant are clearly good! 🤩

Secondly, we see that the column reviewerUsername has a high cardinality or 37870 distinct values. This tells us that we have 37870 different users, rating restaurants. This column is also well distributed as we can see that the most frequent reviewer only has 0.1% of all the reviews.

Finally, by looking at the ratingDate column, the number of reviews are increasing year by year. Most likely due to more technology like smartphones and better internet, and people are also travelling more frequently. Interestingly, there are fewer reviews for the year 2020, which is obviously due to Covid. Let's plot it:

3.3 Miscellaneous 🙌🏼

It was decided to pickle all the dataframes, mainly because it takes a long time to run the latitude and longitude code. It was also done to save time by not having to run all the code again. Now we have the cleaned and prepared dataframes ready for our business questions! 🍔🍟

The reference code can be found here.

purple-divider

4. Business Questions 👩🏽‍💻👨🏼‍💻

The business questions ideas for the project are the following:

  1. Can we predict up and coming neigborhoods based on restaurant rating or newly opened restaurants ?
  2. Does the advertised cousine type of the restaurant represent what the reviewers really like ?
  3. Showing the trend of restaurant and how the movement based on rating and restaurant availability has changed over time
  4. Provide insight into what makes a good and bad review, what are the keywoards and can we filter out to get a report for the restaurant improvement points?
  5. Network analysis based on the connection between reviewers or restaurants
  6. Where should you place your restaurant in the city based on the surrounding restaurant types and causines. Is there space for a new pizza place or are they simply to many ? Are there only low rated pizza places in the neighbourhood where your ambition will thrive ?
  7. Recommendation system based on similar reviewer's interest or similar restaurant characteristics
  8. Create a machine learning model to predict a review rating based on the review text.

The group is going to choose some of these ideas. However, they might change as the project goes along.

Hlynur/Kristín (1, 2, 3, 6)

Helga/Katrín (4, 8):

Business Question: Based on text analytics of reviews. For example, the most common word, what words are associated with good reviews and for bad reviews. Moreover, create a machine learning model to predict a review rating based on the review text.